library(openintro)
source("covid_study_data_plotter.R")## [1] "~~~~~ LOADING DATA ~~~~~~"
## [1] "Founds cached data folder."
## [1] "HRR shapes data loaded from local cache."
## [1] "Loaded zip hrr crosswalk data from local cache."
## [1] "Loaded ZCTA codes and population data from local cache."
## [1] "Approximated zip code population using ZCTA populations: "
## [1] "--After joining US state zip codes with ZCTAs, We keep about 98.8 % of the population, loosing 3,738,434 people."
## [1] "--zctas also account for the population in us territories, which have a total population of 3,623,895 according to wikipedia."
## [1] "--This pushes the retained popoulation of the states closer to 99.96 percent"
View the Shiny App here: https://cinderscript.shinyapps.io/Covid-Vaccination-and-Hospital-Strain/
This project is available on GitHub
This is a study of vaccination rate and hospital bed usage in the United States inspired by the Washington Post article Mapping America’s hospitalization and vaccination divide. In this project we will be recreating the USA map found in the article and adding interactivity so different dates and variables can be selected.
Washington Post Bivariate Choropleth Map
Vaccination data is provided at the county level by the CDC and hospital bed usage data is provided by HealthData.gov. These variables are visualized with a bivariate choropleth map of Hospital Referral Regions in the United States.
Vaccination Rate is defined by county and HRRs are defined by zip code. We can’t use county data to calculate the vaccination rate of an HRR because these regions overlap. Zip codes in one HRR can live in different counties, and Zip codes in different counties can live in the same HRR.
We need to know both the population of each HRR and the vaccination rate of each part of that population.
We determine the vaccination rate of the HHRs by averaging the vaccination rate (given by county) of the zip codes in that HRR. The individual zip code’s vaccination rate needs to be weighted by that zip code’s population. Population data is obtained from the United States Census Bureau, which is unfortunately not counted by zip code, but by blocks that make up the congressional districts. To estimate the population of zip codes, we will use the 2010 census zip code tabulation records, which approximate the zip codes in which the congressional district blocks lay.
To find all zip codes in an HRR we will use a Zip Code to HRR crosswalk.
Anytime a new file is downloaded, that file is cached in the “cached-data” folder. Anytime a request is made for a dataset by date, this folder is checked first.
All data sources are downloaded from the internet. Only the needed portions of the datasets are requested from the corresponding endpoints. Before downloading, this application first checks if the dataset has already been downloaded in a local cache file. Whenever a new portion of a dataset is downloaded, it is saved to the cache folder local to the application folder.
Vaccination rates are obtained from the CDC: https://data.cdc.gov/Vaccinations/COVID-19-Vaccinations-in-the-United-States-County/8xkx-amqh. This dataset is large, so instead of downloading the whole dataset, this application accesses it through the SODA API and retrieves only the relevant rows and columns.
The “COVID-19 Vaccinations in the United States,County” data provides counts and percentages of people who have been vaccinated in each county of the United States.
The variables retrieved are:
Hospital bed usage counts is obtained from HealthData.gov: https://healthdata.gov/Hospital/COVID-19-Reported-Patient-Impact-and-Hospital-Capa/anag-cw7u. This dataset is large, so instead of downloading the whole dataset, this application accesses it through the SODA API and retrieves only the relevant rows and columns.
The “COVID-19 Reported Patient Impact and Hospital Capacity by Facility” data provides counts on hospital bed utilization that is aggregated weekly.
The variables retrieved are:
Inpatient bed counts are used instead of total bed counts because many hospitals that only have inpatient bed data do not include data for inpatient and outpatient totals.
Population data is obtained from the United States Census Bureau using their 2010 ZCTA to County Relationship File (zcta_county_rel_10.txt)
download: https://www2.census.gov/geo/docs/maps-data/data/rel/zcta_county_rel_10.txt column descriptions: https://www.census.gov/programs-surveys/geography/technical-documentation/records-layout/2010-zcta-record-layout.html#par_textimage_0
County shape data is obtained from the R package
Albersusa.
Shape data for USA HRR regions are downloaded from arcgis.com using their “FeatureServer” REST api.
https://www.arcgis.com/home/item.html?id=46bf6790c4e0455e9379ee9769b1a5ab
This crosswalk is obtained from Dartmouth Atlas as a zip file.
We also calculate:
The hospital bed dataset contains records for each hospital. For mapping, each map region will need to represent the average of all hospitals present in that region.
A dataset is needed that tracks the population percentages of each ZCTA in each county for the function that calculates vaccination rates of HRRs. This dataset is generated by joining ZCTA populations with zip codes. The partial populations of ZCTAs in a given zip code (due to overlapping zcta and zipcode regions) is accounted for.
Processed by Source: covid_study_data_wrangler.R, which
creates functions for generating graph-able datasets
We are using population data for zip code tabulation areas because we don’t have counts for zip codes. ZCTAs will be our proxy for zip codes. Because zcta’s don’t line up exactly with zip codes, the same zcta can be part of more than one county. That is why it is important to know how much of the ZCTA’s population is in each county.
The Process:
Function defined in Source:
covid_study_data_wrangler.R :: calculate_hrr_vaccination_rates(date)
There is a lot of missing data for both hospital bed usage and vaccination rates. TEXAS records before 2021-10-22 are removed. Before 2021-10-22, Texas had problems with their recorded vaccination rates and they are recorded as ‘0’ in the dataframes. These records are removed so they don’t throw off the percentages of graphed stats and automatic range scaling of the graphs.
Information: https://www.texastribune.org/2021/01/20/texas-coronavirus-vaccine-data/
Replace 0% single dose percentages with NA where appropriate Also, Many records for the single dose == 0 while series complete is > 0. This isn’t possible so these values probably were not recorded. Replace them with NA so that future data wrangling ignores those values in calculations.
By 2021-02-01 all HHR region states have some single dose percentage TX is the only state that has 0%, which we don’t have collected data for. After 2021-01-31, all entries that have 0% single dose are removed. (All entries that are either before 2021-02-01 or have some percentage of single dose are kept)
covid_study_data_loader.R | loads all required data
covid_study_data_wrangler.R | functions for generating
graph-able datasets covid_study_data_plotter.R | functions
for generating ggplot and ggplotly graphs app.R | shiny
app
Source Dependency:
app.R –> covid_study_data_plotter.R
–> covid_study_data_wrangler.R –>
covid_study_data_loader.R
graph_plotly_point_plot(
Graph_Vaccination_Hospitalization_Plot("2021-09-23", x_axis = "vacc_complete_percent", y_axis = "covid_bed_usage_ratio")
)## [1] "Vaccination records loaded from local cache."
## [1] "Hospital bed data loaded from local cache."
## Warning in geom_point(aes(color = population, text = text), size = 1.2):
## Ignoring unknown aesthetics: text
## `geom_smooth()` using formula = 'y ~ x'
Graph_Vaccination_Rates_Choropleth_By_Hrr_Static("2021/09/24", display_stat = "vacc_complete_percent")## [1] "Vaccination records loaded from local cache."
graph_plotly_vacc_choropleth(Graph_Vaccination_Rates_Choropleth_By_Hrr("2021/09/24", display_stat = "single_dose_percent", is_scale_adaptive = F))## [1] "Vaccination records loaded from local cache."
## Warning in layer_sf(geom = GeomSf, data = data, mapping = mapping, stat =
## stat, : Ignoring unknown aesthetics: text